Intro to the Tidyverse

Author

Brandon Le

Published

November 20, 2025

What is the Tidyverse?

The Tidyverse (https://www.tidyverse.org), is a collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

Packages included in the Tidyverse are:

a system for declaratively creating graphics, based on The Grammar of Graphics

grammar of data manipulation

fast and friendly way to read rectangular data (like csv, tsv, and fwf)

data.frames that are lazy and surly

help you create tidy data

cohesive set of functions designed to make working with strings as easy as possible

suite of tools that solve common problems with factors

helps with date-time data

enhances R’s functional programming (FP) toolkit

Installing the Tidyverse package

You can install R packages from several sources:

  • CRAN (Comprehensive R Archive Network)

    install.packages("tidyverse", Ncpus = 6)
  • Github

    require("devtools") # install devtools before loading library
    library(devtools)
    # https://github.com/tidyverse/tidyverse
    devtools::install_github("tidyverse/tidyverse") 
  • Source file (tar.gz)

    # path_to_file is the full path to the tar.gz file
    install.packages(path_to_file, repos = NULL, type = "source") 
  • RStudio (using the Tools –> Install Packages)


Loading the packages

We will load the tidyverse and palmerpenguins packages. The palmerpenguins package contains a dataset we will use to explore the many functions within the tidyverse.

library(tidyverse)
library(palmerpenguins) # load penguins data

Palmerpenguins Dataset

palmerpenguins

Artwork by @allison_horst

Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

There are 3 different species of penguins in this dataset, collected from 3 islands in the Palmer Archipelago, Antarctica. Data from 344 penguins were recorded.

You can check out more data exploration and visualization with the palmerpenguins dataset here: palmerpenguins.

Lets explore the penguins dataset further.

penguins
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex     year
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int> <fct>  <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750 male    2007
 2 Adelie  Torgersen           39.5          17.4               186        3800 female  2007
 3 Adelie  Torgersen           40.3          18                 195        3250 female  2007
 4 Adelie  Torgersen           NA            NA                  NA          NA <NA>    2007
 5 Adelie  Torgersen           36.7          19.3               193        3450 female  2007
 6 Adelie  Torgersen           39.3          20.6               190        3650 male    2007
 7 Adelie  Torgersen           38.9          17.8               181        3625 female  2007
 8 Adelie  Torgersen           39.2          19.6               195        4675 male    2007
 9 Adelie  Torgersen           34.1          18.1               193        3475 <NA>    2007
10 Adelie  Torgersen           42            20.2               190        4250 <NA>    2007
# ℹ 334 more rows

The penguins dataset is in a dataframe called a tibble. This tibble contains 344 rows x 8 columns. There is a type description for each variable:

  • int : integers
  • dbl : doubles or real numbers
  • chr : character or strings
  • fct : factors

Lets examine the column data.

glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, …
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torge…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, 37.8, 37.8, 41.1, 38.6, 34.6, 36.6, 38.7, 42.5, 34.4, 46.0, 37.8, 37.7, 35.9, 38.2, 38.8, 35.3, 40.6, 40.5, 37.9, 40.5, 39.5, 37.2, 39…
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, 17.1, 17.3, 17.6, 21.2, 21.1, 17.8, 19.0, 20.7, 18.4, 21.5, 18.3, 18.7, 19.2, 18.1, 17.2, 18.9, 18.6, 17.9, 18.6, 18.9, 16.7, 18.1, 17…
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180, 182, 191, 198, 185, 195, 197, 184, 194, 174, 180, 189, 185, 180, 187, 183, 187, 172, 180, 178, 178, 188, 184, 195, 196, 190, 180, 181…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, 3300, 3700, 3200, 3800, 4400, 3700, 3450, 4500, 3325, 4200, 3400, 3600, 3800, 3950, 3800, 3800, 3550, 3200, 3150, 3950, 3250, 3900, 33…
$ sex               <fct> male, female, female, NA, female, male, female, male, NA, NA, NA, NA, female, male, male, female, female, male, female, male, female, male, female, male, male, female, male, female, female, ma…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, …

The column headers include:

species: Chinstrap, Gentoo, Adelie
island: Biscoe, Dream, Torgersen
year: 2007, 2008, 2009
sex: female, male
flipper_length_mm: flipper length (mm)
bill_length_mm: bill length (mm)
bill_depth_mm: bill depth (mm)

Artwork by @allison_horst

Artwork by @allison_horst

Importing Data with readr

The readr package has multiple methods to read in a data file depending on the file type.

  • read_csv(): comma-separated values (CSV)
  • read_tsv(): tab-separated values (TSV)
  • read_csv2(): semicolon-separated values with , as the decimal mark
  • read_delim(): delimited files (CSV and TSV are important special cases)
  • read_fwf(): fixed-width files
  • read_table(): whitespace-separated files
  • read_log(): web log files
These import functions work with local files AND external web data files. For external web data files, provide a URL instead of the file path.

Examples of different file types

  • Reading in Delimited files (e.g. “:”)
pg_delim <- read_delim(file = "data/penguins.txt", delim = ":", col_names = TRUE)
  • Reading in CSV files
pg_csv <- read_csv(file = "data/penguins.csv", col_names = TRUE)
  • Reading in TSV files
pg_tsv <- read_csv(file = "data/penguins.csv", col_names = TRUE)
  • Reading in Excel files
These files require the readxl package!

read_excel will determine whether the file is of .xls or .xlsx format. If you know the specific extension, use read_xls or read_xlsx instead.

require(readxl)
pg_xls <- read_xlsx(path = "data/penguins.xlsx", sheet = NULL, col_names = TRUE)
  • Reading in Googlesheeets files
These files require the googlesheets4 package!

You might see a message requesting authentication with the googlesheets4 package. Select 1 and follow the authorization process. You only need to do this once.

The googlesheets4 package is requesting access to your Google account. 
Enter '1' to start a new auth process or select a pre-authorized account. 

1: Send me to the browser for a new auth process. 
2: email@ucr.edu 
Selection:
require(googlesheets4)
URL <- "https://docs.google.com/spreadsheets/d/1dFh-U1P0PpJurRXpmXbDzFalLpZMsMn7HvjPZ--vznw/edit?usp=sharing"
pg_gsheet <- read_sheet(ss = URL, sheet = NULL, col_names = TRUE)

Data Wrangling/Transformation with dplyr

This section introduces the many functions of the dplyr package for data transformation. There are five key functions in dplyr:

  • Picking observations by their values (i.e., row) (filter())
  • Reorder the rows (arrange())
  • Pick variables by their names (i.e., column) (select())
  • Create new variables with functions of existing variables (mutate())
  • Collapse many values down to a single summary (summarize())
  • Applying functions by group (group_by())

We will be using these functions to explore the palmerpenguin dataset (above).

Filter rows with filter()

filter() allows you to subset observations based on their values.

# female penguins only
filter(penguins, sex == "female") 

# data collected from 2007 or 2008
filter(penguins, year == 2007 | year == 2008) 
filter(penguins, year %in% c(2007,2008))

# penguins with bill_length < 40 or bill_depth < 20
filter(penguins, !(bill_length_mm > 40 | bill_depth_mm < 20))
filter(penguins, bill_length_mm <= 40, bill_depth_mm < 20)

# penguins with bill_length > 40 & body_mass > 3500
filter(penguins, bill_length_mm > 45 & body_mass_g > 4000)

# remove rows containing NA in bill length
filter(penguins, !is.na(bill_length_mm))

Exercises

  1. How many penguins male penguins have bill length > 50?
  2. How many penguins from the Adelie species were on the Biscoe island?

Solutions

Code
# 1.  How many penguins male penguins have bill length > 50?
filter(penguins, sex == "male" & bill_length_mm > 50)

# 2.  How many penguins from the Adelie species were on the Biscoe island?
filter(penguins, species == "Adelie" & island == "Biscoe")

Arrange rows with arrange()

arrange() works similarly to filter() except that instead of selecting rows, it changes their order.

# sort penguins by sex, species, island
arrange(penguins, sex, species, island)

# sort penguins by bill length, in descending order
arrange(penguins, desc(bill_length_mm))

Exercises

  1. Sort the data by species, then by bill length (in descending order)
  2. Sort the data by island, then body mass (in descending order), then flipper length

Solutions

Code
# 1.  Sort the data by species, then by bill length (in descending order)
arrange(penguins, species, desc(bill_length_mm))

# 2.  Sort the data by island, then body mass (in descending order), then flipper length
arrange(penguins, island, desc(body_mass_g), flipper_length_mm)

Select columns with select()

select() allows you to rapidly zoom in on a useful subset of variables based on the variable name

# select columns by name(e.g., species, bill length, and body mass)
select(penguins, species, bill_length_mm, body_mass_g)

# select all columns between species and bill depth (inclusive)
select(penguins, species:bill_depth_mm)

# select all columns except those from island to flipper length (inclusive)
select(penguins, -(island:flipper_length_mm))

# select the species column and all columms that begins with "bill"
select(penguins, species, starts_with("bill")) 

# select the species column and all columns that ends with "mm"
select(penguins, species, ends_with("_mm"))

# select the species column and all columns with "length"
select(penguins, species, contains("length"))

# rename a variable (e.g. species to genera)
rename(penguins, genera = species)

Exercises

  1. Select the island, species, and all columns containing “th”
  2. Select just the columns containing measurements
  3. Remove the body_mass_g column from the table

Solutions

Code
# 1.  Select the island, species, and all columns containing "th"
select(penguins, island, species, contains("th"))

# 2.  Select just the columns containing measurements
select(penguins, bill_length_mm:body_mass_g)
select(penguins, -c(species, island, sex, year))

# 3.  Remove the body_mass_g column from the table
select(penguins, -(body_mass_g))

Add new column variables with mutate()

mutate() always add new columns at the end of your dataset so we’ll start by creating a narrower dataset so we can see the new variables.

# create a subset of penguins data
penguins_sml <- select(penguins, 
  -c(island, year)
)

# create variable bill_length_cm
mutate(penguins_sml,
  flipper_length_cm = flipper_length_mm / 10,
  log10_body_mass_g = log10(body_mass_g) 
)

# create new variables using other variables
mutate(penguins_sml,
  ratio_bill_len_dep_mm = bill_length_mm / bill_depth_mm
)

# only display the new variables
transmute(penguins_sml,
  ratio_bill_len_dep_mm = bill_length_mm / bill_depth_mm
)

Exercises

  1. Create a new variable call index, where index is proportional to flipper length (mm) times the ratio of bill length (mm) to bill depth (mm)
  2. Create a new variable call bmi, where bmi is the index (in 1) divided by body mass (in kg)

Solutions

Code
# 1.  Create a new variable call index, where index is proportional to flipper length (mm) times the ratio of bill length (mm) to bill depth (mm)
mutate(penguins, index = flipper_length_mm * (bill_length_mm / bill_depth_mm))

# 2.  Create a new variable call bmi, where bmi is the index (in 1) divided by body mass (in kg)
mutate(penguins, 
       index = flipper_length_mm * (bill_length_mm / bill_depth_mm),
       bmi = index / (body_mass_g/1000))

Grouped summaries with summarise() and group_by()

summarise() collapses a data frame into a single row. Some useful summary functions include:

  • mean(x)
  • median(x)
  • sd(x) (standard deviation)
  • n() count
  • sum(!is.na(x)) count of non-missing values
  • n_distinct(x) count distinct values

group_by allows you to subset the data into groups (based on column(s) data)

# mean bill length for all penguins surveyed
summarise(penguins, mean_bill_len = mean(bill_length_mm, na.rm = TRUE))

# mean bill length by species and island
species_island <- group_by(penguins, species, island)
summarise(species_island, mean_bill_len = mean(bill_length_mm, na.rm = TRUE))

# summarize by muliple conditions on grouped data (species, island)
# number penguins, mean bill length, median flipper length, minimum body mass, maximum body mass
species_island <- group_by(penguins, species, island)
summarise(species_island, no_penguins = n(),
          mean_bill_len = mean(bill_length_mm, na.rm = TRUE),
          median_flipper_len = median(flipper_length_mm, na.rm = TRUE),
          min_body_mass_g = min(body_mass_g, na.rm = TRUE),
          max_body_mass_g = max(body_mass_g, na.rm = TRUE)
          )

Exercises

  1. Summarize by species, the number of penguins and the average body mass
  2. Summarize by species and sex, the number of penguins and the average body mass

Solutions

Code
# 1.  Summarize by species, the number of penguins and the average body mass
species <- group_by(penguins, species)
summarise(species, no_penguins = n(), mean_body_mass = mean(body_mass_g))

# 2.  Summarize by species and sex, the number of penguins and the average body mass
species_sex <- group_by(penguins, species, sex)
summarise(species_sex, no_penguins = n(), mean_body_mass = mean(body_mass_g))

Useful Commands

These are a few handy commands that you will likely encounter when wrangling your data.

# removing all rows containing NA in any column
penguins |> na.omit()

# removing rows containing NA from specific columns (e.g., bill_length, bill_depth)
filter_at(penguins, vars(bill_length_mm:sex), all_vars(!is.na(.)))

# renaming columns using select()
select(penguins, penguin_type = species, collection_year = year)

# write over previous column data
mutate(penguins, sex = case_when(
  sex == "female" ~ "F",
  sex == "male" ~ "M",
  TRUE ~ NA
))

# convert data type (year from int to char)
mutate(penguins, year = as.character(year))

Exporting files using readr

Similar to the functions used to import data into R, there are corresponding functions to export (i.e. write) data to files depending on the output data type

  • write_delim() : delimited files (CSV and TSV are important special cases)
  • write_csv() : comma-separated values
  • write_tsv() : tab-separated values (TSV)
  • write_excel_csv() : Excel format CSV
  • write_sheet() : Googlesheet

Examples of different file types

  • Writing to delimited file (e.g. “:”)
write_delim(object_name, file = "data/table.txt", delim = ":", col_names = TRUE)
  • Writing to CSV
write_csv(object_name, file = "data/table.csv", col_names = TRUE)
  • Writing to TSV
write_tsv(object_name, file = "data/table.tsv", col_names = TRUE)
  • Writing to Excel
write_excel_csv(object_name, file = "data/table.xls", col_names = TRUE)
  • Write to Googlesheets
write_delim(object_name, ss = "googlesheet_name", sheet = NULL)
You can write to a compressed file by adding the compression extension (e.g. gz, zip, bz2) to the filename

write_csv(object_name, file = "data/table.csv.gz", col_names = TRUE)

Exercise: Wrangling a metadata file

In this section, we will wrangle a metadata file by doing the following:

  • Read in the file
  • Examine the data structure
  • Subset the data with the “Sample_ID”, “Treatment Group”, “Sequencing Depth (M)”, “Technician Name”, and “Sequencer_Platform”
  • Rename the column headings (replacing those with space or - to underscore)
  • For the Treatment Group, replace “High_Dose” to “high”, “Low_Dose” to “low”, “Control” to “control” and change the column name to “dosage”
  • For the Technician Name, convert “alice” to “Alice” and “BOB” to “Bob”
  • Create a total_cost column where cost is the sequencing depth * $100/M reads

From the data, address the following questions:

  • Summarize the total number of samples processed by each technician per sequencer platform

  • How many samples are suitable for further downstream analysis (requires a minimum of 35M reads per sample)

Solutions

Code
data_url <- "https://raw.githubusercontent.com/bioinformatics-workshop/Intro-to-Tidyverse-2025/refs/heads/main/data/metadata.csv"

metadata <- read_csv(data_url, col_names = TRUE)

metadata
glimpse(metadata)

metadata_subset <- metadata |>
  select(Sample_ID, `Treatment Group`, `Sequencing Depth (M)`, `Technician Name`, Sequencer_Platform) |>
  rename(dosage = `Treatment Group`,
         seq_depth_M = `Sequencing Depth (M)`,
         tech_name = `Technician Name`) |>
  mutate(dosage = case_when(
    dosage == "High_Dose" ~ "high",
    dosage == "Low_Dose" ~ "low",
    dosage == "Control" ~ "control",
    TRUE ~ NA
  )) |>
  mutate(tech_name = case_when(
    tech_name == "alice" ~ "Alice",
    tech_name == "BOB" ~ "Bob",
    TRUE ~ tech_name
  )) |>
  mutate(total_cost = seq_depth_M * 100)
  
# Addressing questions
# Summarize the total number of samples processed by each technician per sequencer platform
metadata_subset |>
  group_by(tech_name, Sequencer_Platform) |>
  summarise(no_samples = n())

# How many samples are suitable for further downstream analysis (requires a minimum of 35M reads per sample)
metadata_subset |>
  filter(seq_depth_M > 35)

Additional Resources

R for Non-Programmers

Data Science Workshops (Harvard Business School and Institute for Quantitative Social Science)

Tidyverse Skills for Data Science

Session Info

sessionInfo()
R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Rocky Linux 8.10 (Green Obsidian)

Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblas-r0.3.15.so;  LAPACK version 3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  utils     datasets  grDevices methods   base     

other attached packages:
 [1] palmerpenguins_0.1.1 lubridate_1.9.3      forcats_1.0.0        stringr_1.5.1        dplyr_1.1.4          purrr_1.0.2          readr_2.1.5          tidyr_1.3.1          tibble_3.2.1         ggplot2_3.5.1       
[11] tidyverse_2.0.0     

loaded via a namespace (and not attached):
 [1] gtable_0.3.6      jsonlite_1.8.9    compiler_4.4.2    tidyselect_1.2.1  scales_1.3.0      yaml_2.3.10       fastmap_1.2.0     R6_2.5.1          generics_0.1.3    httr2_1.1.2       knitr_1.48        htmlwidgets_1.6.4
[13] ellmer_0.2.1      munsell_0.5.1     tzdb_0.4.0        pillar_1.9.0      rlang_1.1.4       utf8_1.2.4        stringi_1.8.4     xfun_0.49         S7_0.2.0          timechange_0.3.0  cli_3.6.3         withr_3.0.2      
[25] magrittr_2.0.3    digest_0.6.37     grid_4.4.2        rstudioapi_0.17.1 hms_1.1.3         rappdirs_0.3.3    lifecycle_1.0.4   coro_1.1.0        vctrs_0.6.5       evaluate_1.0.1    glue_1.8.0        fansi_1.0.6      
[37] colorspace_2.1-1  rmarkdown_2.29    tools_4.4.2       pkgconfig_2.0.3   htmltools_0.5.8.1